<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html><head><title>Programming in Lua : 29.2</title>
<link rel="stylesheet" type="text/css" href="pil.css">
</head>
<body>
<p class="warning">
<A HREF="contents.html"><IMG TITLE="Programming in Lua (first edition)" SRC="capa.jpg" ALT="" ALIGN="left"></A>This first edition was written for Lua 5.0. While still largely relevant for later versions, there are some differences.<BR>The third edition targets Lua 5.2 and is available at <A HREF="http://www.amazon.com/exec/obidos/ASIN/859037985X/theprogrammil3-20">Amazon</A> and other bookstores.<BR>By buying the book, you also help to <A HREF="../donations.html">support the Lua project</A>.
</p>
<table width="100%" class="nav">
<tr>
<td width="10%" align="left"><a href="29.1.html"><img border="0" src="left.png" alt="Previous"></a></td>
<td width="80%" align="center"><a href="contents.html"><font face="Helvetica,Arial,sanserif">
<font color="gray">Programming in </font><font color="blue"> Lua</font>
</font></a></td>
<td width="10%" align="right"></td>
</tr>
<tr>
<td width="10%" align="left"></td>
<td width="80%" align="center"><a href="contents.html#P4">Part IV. The C API</a>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<a href="contents.html#29">Chapter 29. Managing Resources</a></td>
<td width="10%" align="right"></td></tr>
</table>
<hr/>
<p><h2>29.2 &ndash; An XML Parser</h2>

<p>Now we will look at a simplified implementation of <code>lxp</code>,
a binding between Lua and Expat.
Expat is an open source XML 1.0 parser written in C.
It implements SAX, the <em>Simple API for XML</em>.
SAX is an event-based API.
That means that a SAX parser reads an XML document and,
as it goes,
reports to the application what it finds, through callbacks.
For instance,
if we instruct Expat to parse a string like
<pre>
    &lt;tag cap="5">hi&lt;/tag>
</pre>
it will generate three events:
a <em>start-element</em> event,
when it reads the substring <code>"&lt;tag cap="5">"</code>;
a <em>text</em> event (also called a <em>character data</em> event),
when it reads <code>"hi"</code>;
and an <em>end-element</em> event, when it reads <code>"&lt;/tag>"</code>.
Each of these events calls an appropriate
<em>callback handler</em> in the application.

<p>Here we will not cover the entire Expat library.
We will concentrate only on those parts that illustrate
new techniques for interacting with Lua.
It is easy to add bells and whistles later,
after we have implemented this core functionality.
Although Expat handles more than a dozen different events,
we will consider only the three events that we saw
in the previous example (start elements, end elements, and text).
The part of the Expat API that we need for this example is small.
First, we need functions to create and destroy an Expat parser:
<pre>
    #include &lt;xmlparse.h>
    
    XML_Parser XML_ParserCreate (const char *encoding);
    void XML_ParserFree (XML_Parser p);
</pre>
The argument <code>encoding</code> is optional;
we will use <code>NULL</code> in our binding.

<p>After we have a parser,
we must register its callback handlers:
<pre>
    XML_SetElementHandler(XML_Parser p,
                          XML_StartElementHandler start,
                          XML_EndElementHandler end);
    
    XML_SetCharacterDataHandler(XML_Parser p,
                                XML_CharacterDataHandler hndl);
</pre>
The first function registers handlers for start and end elements.
The second function registers handlers for text
(<em>character data</em>, in XML parlance).

<p>All callback handlers receive
some user data as their first parameter.
The start-element handler receives also the tag name
and its attributes:
<pre>
    typedef void (*XML_StartElementHandler)(void *uData,
                                            const char *name,
                                            const char **atts);
</pre>
The attributes come as a NULL-terminated array of strings,
where each pair of consecutive strings holds an attribute name
and its value.
The end-element handler has only one extra parameter, the tag name:
<pre>
    typedef void (*XML_EndElementHandler)(void *uData,
                                          const char *name);
</pre>
Finally, a text handler receives only the text as an extra parameter.
This text string is not null-terminated;
instead, it has an explicit length:
<pre>
    typedef void
    (*XML_CharacterDataHandler)(void *uData,
                                const char *s,
                                int len);
</pre>

<p>To feed text to Expat, we use the following function:
<pre>
    int XML_Parse (XML_Parser p,
                   const char *s, int len, int isFinal);
</pre>
Expat receives the document to be parsed in pieces,
through successive calls to <code>XML_Parse</code>.
The last argument to <code>XML_Parse</code>, <code>isFinal</code>,
informs Expat whether that piece is the last one of a document.
Notice that each piece of text does not need to be zero terminated;
instead, we supply an explicit length.
The <code>XML_Parse</code> function returns zero if it detects a parse error.
(Expat provides auxiliary functions to retrieve error information,
but we will ignore them here, for the sake of simplicity.)

<p>The last function we need from Expat allows us to set
the user data that will be passed to the handlers:
<pre>
    void XML_SetUserData (XML_Parser p, void *uData);
</pre>

<p>Now let us have a look at how we can use this library in Lua.
A first approach is a direct approach:
Simply export all those functions to Lua.
A better approach is to adapt the functionality to Lua.
For instance, because Lua is untyped, we do not need different
functions to set each kind of callback.
Better yet, we can avoid the callback registering functions altogether.
Instead, when we create a parser,
we give a callback table that contains all callback handlers,
each with an appropriate key.
For instance, if we only want to print a layout of a document,
we could use the following callback table:
<pre>
    local count = 0
    
    callbacks = {
      StartElement = function (parser, tagname)
        io.write("+ ", string.rep("  ", count), tagname, "\n")
        count = count + 1
      end,
    
      EndElement = function (parser, tagname)
        count = count - 1
        io.write("- ", string.rep("  ", count), tagname, "\n")
      end,
    }
</pre>
Fed with the input <code>"&lt;to> &lt;yes/> &lt;/to>"</code>,
those handlers would print
<pre>
    + to
    +   yes
    -   yes
    - to
</pre>
With this API, we do not need functions to manipulate callbacks.
We manipulate them directly in the callback table.
Thus, the whole API needs only three functions:
one to create parsers, one to parse a piece of text,
and one to close a parser.
(Actually, we will implement the last two functions as
methods of parser objects.)
A typical use of the API could be like this:
<pre>
    p = lxp.new(callbacks)     -- create new parser
    for l in io.lines() do     -- iterate over input lines
      assert(p:parse(l))               -- parse the line
      assert(p:parse("\n"))            -- add a newline
    end
    assert(p:parse())        -- finish document
    p:close()
</pre>

<p><pre>

</pre>

<p>Now let us turn our attention to the implementation.
The first decision is how to represent a parser in Lua.
It is quite natural to use a userdatum,
but what do we need to put inside it?
At least, we must keep the actual Expat parser
and the callback table.
We cannot store a Lua table inside a userdatum
(or inside any C structure);
however, we can create a reference to the table
and store the reference inside the userdatum.
(Remember from <a href="27.3.2.html#references">Section 27.3.2</a> that a reference
is a Lua-generated integer key in the registry.)
Finally, we must be able to store a Lua state into a parser object,
because these parser objects is all that an Expat callback receives from
our program, and the callbacks need to call Lua.
Therefore, the definition for a parser object is as follows:
<pre>
    #include &lt;xmlparse.h>
    
    typedef struct lxp_userdata {
      lua_State *L;
      XML_Parser *parser;          /* associated expat parser */
      int tableref;   /* table with callbacks for this parser */
    } lxp_userdata;
</pre>

<p>The next step is the function that creates parser objects.
Here it is:
<pre>
    static int lxp_make_parser (lua_State *L) {
      XML_Parser p;
      lxp_userdata *xpu;
    
      /* (1) create a parser object */
      xpu = (lxp_userdata *)lua_newuserdata(L,
                                       sizeof(lxp_userdata));
    
      /* pre-initialize it, in case of errors */
      xpu->tableref = LUA_REFNIL;
      xpu->parser = NULL;
    
      /* set its metatable */
      luaL_getmetatable(L, "Expat");
      lua_setmetatable(L, -2);
    
      /* (2) create the Expat parser */
      p = xpu->parser = XML_ParserCreate(NULL);
      if (!p)
        luaL_error(L, "XML_ParserCreate failed");
    
      /* (3) create and store reference to callback table */
      luaL_checktype(L, 1, LUA_TTABLE);
      lua_pushvalue(L, 1);  /* put table on the stack top */
      xpu->tableref = luaL_ref(L, LUA_REGISTRYINDEX);
    
      /* (4) configure Expat parser */
      XML_SetUserData(p, xpu);
      XML_SetElementHandler(p, f_StartElement, f_EndElement);
      XML_SetCharacterDataHandler(p, f_CharData);
      return 1;
    }
</pre>
The <code>lxp_make_parser</code> function has four main steps:
<ul>

<p><li>
Its first step follows a common pattern:
It first creates a userdatum;
then it pre-initializes the userdatum with consistent values;
and finally sets its metatable.
The reason for the pre-initialization is subtle:
If there is any error during the initialization,
we must make sure that the finalizer (the <code>__gc</code> metamethod)
will find the userdata in a consistent state.

<p><li>
In step 2, the function creates an Expat parser,
stores it in the userdatum, and checks for errors.

<p><li>
Step 3 ensures that the first argument to the function
is actually a table (the callback table),
creates a reference to it,
and stores the reference into the new userdatum.

<p><li>
The last step initializes the Expat parser.
It sets the userdatum as the object to be passed to callback functions
and it sets the callback functions.
Notice that these callback functions are the same for all parsers;
after all, it is impossible to dynamically create new functions in C.
Instead, these fixed C functions will use the callback table to
decide which Lua functions they should call each time.

<p></ul>

<p>The next step is the <code>parse</code> method,
which parses a piece of XML data.
It gets two arguments:
The parser object (the <em>self</em> of the method)
and an optional piece of XML data.
When called without any data,
it informs Expat that the document has no more parts:
<pre>
    static int lxp_parse (lua_State *L) {
      int status;
      size_t len;
      const char *s;
      lxp_userdata *xpu;
    
      /* get and check first argument (should be a parser) */
      xpu = (lxp_userdata *)luaL_checkudata(L, 1, "Expat");
      luaL_argcheck(L, xpu, 1, "expat parser expected");
    
      /* get second argument (a string) */
      s = luaL_optlstring(L, 2, NULL, &amp;len);
    
      /* prepare environment for handlers: */
      /* put callback table at stack index 3 */
      lua_settop(L, 2);
      lua_getref(L, xpu->tableref);
      xpu->L = L;  /* set Lua state */
    
      /* call Expat to parse string */
      status = XML_Parse(xpu->parser, s, (int)len, s == NULL);
    
      /* return error code */
      lua_pushboolean(L, status);
      return 1;
    }
</pre>
When <code>lxp_parse</code> calls <code>XML_Parse</code>,
the latter function will call the handlers for each relevant
element that it finds in the given piece of document.
Therefore, <code>lxp_parse</code> first prepares an environment
for these handlers.
There is one more detail in the call to <code>XML_Parse</code>:
Remember that the last argument to this function
tells Expat whether the given piece of text is the last one.
When we call <code>parse</code> without an argument <code>s</code> will be <code>NULL</code>,
so this last argument will be true.

<p>Now let us turn our attention to the callback functions
<code>f_StartElement</code>, <code>f_EndElement</code>, and <code>f_CharData</code>.
All those three functions have a similar structure:
Each checks whether the callback table defines a Lua
handler for its specific event and, if so, prepares the
arguments and then calls that Lua handler.

<p>Let us first see the <code>f_CharData</code> handler.
Its code is quite simple.
It calls its corresponding handler in Lua (when present)
with only two arguments:
the parser and the character data (a string):
<pre>
    static void f_CharData (void *ud, const char *s, int len) {
      lxp_userdata *xpu = (lxp_userdata *)ud;
      lua_State *L = xpu->L;
    
      /* get handler */
      lua_pushstring(L, "CharacterData");
      lua_gettable(L, 3);
      if (lua_isnil(L, -1)) {  /* no handler? */
        lua_pop(L, 1);
        return;
      }
    
      lua_pushvalue(L, 1);  /* push the parser (`self') */
      lua_pushlstring(L, s, len);  /* push Char data */
      lua_call(L, 2, 0);  /* call the handler */
    }
</pre>
Notice that all these C handlers receive a
<code>lxp_userdata</code> structure as their first argument,
due to our call to <code>XML_SetUserData</code> when
we create the parser.
Also notice how it uses the environment set by <code>lxp_parse</code>.
First, it assumes that the callback table is at stack index 3.
Second, it assumes that the parser itself is at stack index 1
(it must be there,
because it should be the first argument to <code>lxp_parse</code>).

<p>The <code>f_EndElement</code> handler is also simple
and quite similar to <code>f_CharData</code>.
It also calls its corresponding Lua handler with two arguments:
the parser and the tag name (again a string, but now null-terminated):
<pre>
    static void f_EndElement (void *ud, const char *name) {
      lxp_userdata *xpu = (lxp_userdata *)ud;
      lua_State *L = xpu->L;
    
      lua_pushstring(L, "EndElement");
      lua_gettable(L, 3);
      if (lua_isnil(L, -1)) {  /* no handler? */
        lua_pop(L, 1);
        return;
      }
    
      lua_pushvalue(L, 1);  /* push the parser (`self') */
      lua_pushstring(L, name);  /* push tag name */
      lua_call(L, 2, 0);  /* call the handler */
    }
</pre>

<p>The last handler, <code>f_StartElement</code>,
calls Lua with three arguments:
the parser, the tag name, and a list of attributes.
This handler is a little more complex than the others,
because it needs to translate the tag's list of attributes into Lua.
We will use a quite natural translation.
For instance, a start tag like
<pre>
    &lt;to method="post" priority="high">
</pre>
generates the following table of attributes:
<pre>
    { method = "post", priority = "high" }
</pre>
The implementation of <code>f_StartElement</code> follows:
<pre>
    static void f_StartElement (void *ud,
                                const char *name,
                                const char **atts) {
      lxp_userdata *xpu = (lxp_userdata *)ud;
      lua_State *L = xpu->L;
    
      lua_pushstring(L, "StartElement");
      lua_gettable(L, 3);
      if (lua_isnil(L, -1)) {  /* no handler? */
        lua_pop(L, 1);
        return;
      }
    
      lua_pushvalue(L, 1);  /* push the parser (`self') */
      lua_pushstring(L, name);  /* push tag name */
    
      /* create and fill the attribute table */
      lua_newtable(L);
      while (*atts) {
        lua_pushstring(L, *atts++);
        lua_pushstring(L, *atts++);
        lua_settable(L, -3);
      }
    
      lua_call(L, 3, 0);  /* call the handler */
    }
</pre>

<p>The last method for parsers is <code>close</code>.
When we close a parser, we have to free all its resources,
namely the Expat structure and the callback table.
Remember that, due to occasional errors during its creation,
a parser may not have these resources:
<pre>
    static int lxp_close (lua_State *L) {
      lxp_userdata *xpu;
    
      xpu = (lxp_userdata *)luaL_checkudata(L, 1, "Expat");
      luaL_argcheck(L, xpu, 1, "expat parser expected");
    
      /* free (unref) callback table */
      luaL_unref(L, LUA_REGISTRYINDEX, xpu->tableref);
      xpu->tableref = LUA_REFNIL;
    
      /* free Expat parser (if there is one) */
      if (xpu->parser)
        XML_ParserFree(xpu->parser);
      xpu->parser = NULL;
      return 0;
    }
</pre>
Notice how we keep the parser in a consistent
state as we close it,
so there is no problem if we try to close it again
or when the garbage collector finalizes it.
Actually, we will use exactly this function as the finalizer.
That ensures that every parser eventually frees its resources,
even if the programmer does not close it.

<p>The final step is to open the library,
putting all those parts together.
We will use here the same scheme that we used in the
object-oriented array example (<a href="28.3.html#oo-array">Section 28.3</a>):
We will create a metatable,
put all methods inside it,
and make its <code>__index</code> field point to itself.
For that, we need a list with the parser methods:
<pre>
    static const struct luaL_reg lxp_meths[] = {
      {"parse", lxp_parse},
      {"close", lxp_close},
      {"__gc", lxp_close},
      {NULL, NULL}
    };
</pre>
We also need a list with the functions of this library.
As is common with OO libraries, this library has a single function,
which creates new parsers:
<pre>
    static const struct luaL_reg lxp_funcs[] = {
      {"new", lxp_make_parser},
      {NULL, NULL}
    };
</pre>
Finally, the open function must create the metatable,
make it point to itself (through <code>__index</code>),
and register methods and functions:
<pre>
    int luaopen_lxp (lua_State *L) {
      /* create metatable */
      luaL_newmetatable(L, "Expat");
    
      /* metatable.__index = metatable */
      lua_pushliteral(L, "__index");
      lua_pushvalue(L, -2);
      lua_rawset(L, -3);
    
      /* register methods */
      luaL_openlib (L, NULL, lxp_meths, 0);
    
      /* register functions (only lxp.new) */
      luaL_openlib (L, "lxp", lxp_funcs, 0);
      return 1;
    }
</pre>

<p><hr/>
<table width="100%" class="nav">
<tr valign="top">
<td width="90%" align="left">
<small>
  Copyright &copy; 2003&ndash;2004 Roberto Ierusalimschy.  All rights reserved.
</small>
</td>
<td width="10%" align="right"></td>
</tr>
</table>


</body></html>

